Overview

Today, we’ll explore Russians’ support the war in Ukraine using a public opinion survey from Russia conducted by Alexei Miniailo’s “Do Russians Want War” project.

We will look at how support for the war varies with the demographic predictors age, sex and education. We will see how multiple regression can be used to describe more complex relationships. From our baseline model, we will ask:

  • Does the relationship between age and support for the war vary among male and female respondents (Interaction model)

  • Does the relationship between age and support vary at different levels of age (Polynomial regression model)

  • Does the relationship between age and support vary at different levels of age and does the nature of this variation differ among male and female respondents (Polynomial regression model with an interaction term)

To accomplish this, we will do the following:

  1. Get set up to work and describe our data (10 minutes)

  2. Get practice interpreting long chunks of codes with lots of %>%s (10 minutes)

  3. Estimate four models of increasing complexity (10 minutes)

  4. Present these models in a regression table and interpret the results (10 minutes)

  5. Evaluate the relative performance of these models in terms of their variance explained \(R^2\)’s (15 minutes)

  6. Produce predicted values to help us interpret and compare our baseline model to a model where the “effect” of an increase in age changes with age (15 minutes)

  7. Produce predicted values to help us interpret and compare models where the “effect” of age is allowed to vary with respondent’s sex (10 minutes)

Finally, if there’s time, we will:

  1. Explore additional questions of our choosing in the data

One of questions 1-7 will be randomly selected as the graded question for the lab.

set.seed(3172022)
graded_question <- sample(1:7,size = 1)
paste("Question",graded_question,"is the graded question for this week")
[1] "Question 5 is the graded question for this week"

You will work in your assigned groups. Only one member of each group needs to submit the html file produced by knitting the lab.

This lab must contain the names of the group members in attendance.

If you are attending remotely, you will submit your labs individually.

You can find your assigned groups in previous labs

Goals

Broadly, our goals in this assignment are to:

  • Get comfortable estimating and interpreting regression models with interaction terms

    \[y = \beta_0 + \beta_1 x + \beta_2 z + \overbrace{\beta_3 x\cdot z}^{\text{Interaction}} \]
    • Including an interaction between two variables is useful if we think the “effect” of one variable depends on the value of another variable.

    • Today, we test whether the association between support for the war and age differs between male and female respondents.

  • Get comfortable estimating and interpreting regression models with polynomial terms

    \[x^1 = x; \, x^2 = x\cdot x; \, x^{nth} = \prod_{i=1}^{i=n} x_i = x_1\cdot x_2 \cdot x_3 \dots \cdot x_n \]
    • Polynomial terms are a way of incorporating non-linearity in our predictors.
      Our model however is still linear in parameters \(\beta\). If our model included some parameter \(\theta^2\), then it would be a non-linear regression
    • If we think the relationship between a predictor and outcome varies at different levels of the predictor, we can include a polynomial term(s) for that predictor.1
  • Question 1 gives you practice summarizing data, and demonstrates how to construct a nicely formatted table of descriptive statistics to assist in the process

  • Question 2 asks you to run the code from Question 1 sequentially to help reinfornce just what is happening in each step of the process.

  • Question 3 asks you to fit four models. Before you estimate each, I’ll ask you to think through what you would expect to find in each model.

    • m1 is our baseline model, a multiple regression predicting support for the war as function of age, sex, and education

    • m3 extends m1 asks whether the relationship between age and support varies by sex using an interaction term between between age and sex

    • m3 extends m1 allowing the relationship between age and support to vary, by including a polynomial term age^2

    • m4 extends m3 by asking whether the varying effects of age proposed in m3 might further vary by sex

  • Question 4 asks you to display and interpret the results of your regression analyses using a regression table

  • Question 5 asks you to use the concepts of \(R^2\) and adjusted \(R^2\) to evaluate the relative performance of these four models

  • Question 6 gives you practice producing and interpreted predicted values for models m1 and m3 by varying one predictor, age

  • Question 7 gives you practice producing and interpreted predicted values for models m2 and m4 by varying two predictors, age and sex

  • Question 8 gives you a chance to explore the data on your own and ask questions that interest you. If you get to Question 8, I will bring your group candy next week.

Workflow

Please knit this .Rmd file

As with every lab, you should:

  • Download the file
  • Save it in your course folder
  • Update the author: section of the YAML header to include the names of your group members in attendance.
  • Knit the document
  • Open the html file in your browser (Easier to read)
  • Write yourcode in the chunks provided
  • Comment out or delete any test code you do not need
  • Knit the document again after completing a section or chunk (Error checking)
  • Upload the final lab to Canvas.

1 Get setup to work

1.1 Load Packages

First lets load the libraries we’ll need for today.

the_packages <- c(
  ## R Markdown
  "kableExtra","DT","texreg","htmltools",
  "flair", # Comments only
  ## Tidyverse
  "tidyverse", "lubridate", "forcats", "haven", "labelled",
  ## Extensions for ggplot
  "ggmap","ggrepel", "ggridges", "ggthemes", "ggpubr", 
  "GGally", "scales", "dagitty", "ggdag", "ggforce",
  # Data 
  "COVID19","maps","mapdata","qss","tidycensus", "dataverse",
  # Analysis
  "DeclareDesign", "zoo"
)

# Define function to load packages
ipak <- function(pkg){
    new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
    if (length(new.pkg)) 
        install.packages(new.pkg, dependencies = TRUE)
    sapply(pkg, require, character.only = TRUE)
}

ipak(the_packages)
   kableExtra            DT        texreg     htmltools         flair 
         TRUE          TRUE          TRUE          TRUE          TRUE 
    tidyverse     lubridate       forcats         haven      labelled 
         TRUE          TRUE          TRUE          TRUE          TRUE 
        ggmap       ggrepel      ggridges      ggthemes        ggpubr 
         TRUE          TRUE          TRUE          TRUE          TRUE 
       GGally        scales       dagitty         ggdag       ggforce 
         TRUE          TRUE          TRUE          TRUE          TRUE 
      COVID19          maps       mapdata           qss    tidycensus 
         TRUE          TRUE          TRUE          TRUE          TRUE 
    dataverse DeclareDesign           zoo 
         TRUE          TRUE          TRUE 

1.2 Load the data

Next we’ll load the recoded data for the lab by sourcing a script called drww_english_recode.R to download the raw data and recode Cyrilic into English.

If you were click on the link (you don’t need to, but it might be interesting), you’ll see the source code which:

  • Downloads from GitHub the raw data from the DRWW project and loads it into R using the foreign package’s read.spss() function because the data are saved as a .sav file.

  • Creates df_drww from raw_drww, giving the columns (variables) in raw_drww English translations in df_drww

  • Recodes the values of variable in df_drww into numeric (_n suffixes) and factor (_f suffixes) variables (with levels that are in English)

  • Constructs some additional variables like total_social_media_use which is the sum of total number of social media sites a respondent reports using.

source("https://gist.githubusercontent.com/PaulTestaBrown/7565987b8eedc743fa5a57e451abed40/raw/601680d7b20cd1638d93c325156d4b000ea1f9cc/drww_english_recode.R")
  • After running this code, you should see the following in your R environment
ls()
[1] "df_drww"         "graded_question" "ipak"            "raw_drww"       
[5] "the_packages"    "wd"             
  • We’ll only be working with df_drww. So lets remove (rm()) the raw data raw_drww
rm(raw_drww)
ls()
[1] "df_drww"         "graded_question" "ipak"            "the_packages"   
[5] "wd"             

1.3 Describe the data

As always, it’s important to get a high level overview of our data when we first load it into R.

Below we take a look at the first few values of all the data

glimpse(df_drww)
Rows: 1,807
Columns: 42
$ sex                          <chr> "Male", "Male", "Female", "Female", "Male…
$ age                          <dbl> 99, 78, 73, 73, 69, 69, 59, 54, 49, 48, 4…
$ support_war                  <fct> "(НЕ ЗАЧИТЫВАТЬ) Затрудняюсь ответить", "…
$ trust_gov                    <fct> (НЕ ЗАЧИТЫВАТЬ) Затрудняюсь ответить, Дов…
$ employ_working               <dbl> 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1,…
$ employ_student               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ employ_retired               <dbl> 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,…
$ employ_maternity_leave       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ employ_homemaker             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ employ_unemployed            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0,…
$ employ_other_employment      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ other_employ_open_response   <chr> "                                        …
$ education                    <fct> (НЕ ЗАЧИТЫВАТЬ) Затрудняюсь ответить, Выс…
$ social_classmates            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
$ social_in_contact_with       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
$ social_facebook              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
$ social_instagram             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
$ social_twitter               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
$ social_telegram              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
$ social_whatsapp              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
$ social_viber                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
$ social_tiktok                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,…
$ social_other_social_networks <dbl> 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0,…
$ none_social                  <fct> НЕ ВЫБРАН, ВЫБРАН, НЕ ВЫБРАН, ВЫБРАН, НЕ …
$ dk_social                    <fct> ВЫБРАН, НЕ ВЫБРАН, НЕ ВЫБРАН, НЕ ВЫБРАН, …
$ other_social_open_response   <chr> "                                        …
$ other_social_group           <fct> NA, NA, Ютуб, NA, Ютуб, NA, NA, NA, NA, N…
$ geo_urban_rural              <fct> NA, NA, "город, поселок городского типа",…
$ geo_district                 <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ weight                       <dbl> 0.000, 0.000, 1.208, 0.000, 1.231, 1.208,…
$ support_war_f                <fct> NA, Yes, Yes, Yes, Yes, Yes, No, Yes, Yes…
$ support_war01                <dbl> NA, 1, 1, 1, 1, 1, 0, 1, 1, NA, 1, 1, 1, …
$ trust_gov_f                  <fct> NA, Partly, Largely, Fully, Partly, Fully…
$ trust_gov_n                  <dbl> NA, 1, 2, 3, 1, 3, 0, NA, 2, NA, NA, 0, 3…
$ education_f                  <fct> NA, College or some college, Vocational, …
$ education_n                  <dbl> NA, 4, 3, 4, 1, 3, 4, NA, 4, NA, 3, 3, 3,…
$ geo_urban_rural_f            <fct> NA, NA, Urban, NA, Rural, Urban, Rural, N…
$ geo_district_f               <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ social_youtube               <dbl> 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0,…
$ social_yandex                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ total_social_media_use       <dbl> 0, 0, 1, 0, 1, 0, 0, 0, 9, 0, 1, 0, 1, 0,…
$ no_social_media              <dbl> 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1,…

In the first part of this lab, we’ll work with the following variables

  • support_war01

  • age

  • sex

  • education_n

In the code chunk below, I demonstrate how to create a data frame of summary statistics:

the_vars <- c("support_war01","age", "is_female", "education_n")

df_drww %>%
  mutate(
    is_female = ifelse(sex == "Female",1, 0)
  ) %>%
  select(all_of(the_vars)) %>%
  pivot_longer(
    cols = all_of(the_vars ),
    names_to = "Variable",
    values_to = "Value"
  ) %>%
  group_by(Variable) %>%
  summarise(
    `N obs` = n(),
    Missing = sum(is.na(Value)),
    Min = min(Value, na.rm = T),
    `25th perc` = quantile(Value, .25, na.rm=T),
    Mean = mean(Value, na.rm=T),
    Median = median(Value, na.rm = T),
    `75th perc` = quantile(Value, .75, na.rm=T),
    Max = max(Value, na.rm = T)
  ) %>%
  mutate(
    Variable = case_when(
      Variable == "age" ~ "Age",
      Variable == "education_n" ~ "Education",
      Variable == "is_female" ~ "Female",
      Variable == "support_war01" ~ "Support for War",
      
    ),
    Variable = factor(Variable, levels = c("Support for War","Age","Female","Education"))
    ) %>%
  arrange(Variable) ->  summary_table

summary_table
# A tibble: 4 × 9
  Variable     `N obs` Missing   Min `25th perc`   Mean Median `75th perc`   Max
  <fct>          <int>   <int> <dbl>       <dbl>  <dbl>  <dbl>       <dbl> <dbl>
1 Support for…    1807     334     0           0  0.720      1           1     1
2 Age             1807       0    18          34 46.6       45          60    99
3 Female          1807       0     0           0  0.470      0           1     1
4 Education       1807      13     1           3  3.17       3           4     5

Which we can then format into a table of summary statistics:

kable(summary_table,
      digits = 2) %>% 
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover")) %>%
  kableExtra::pack_rows(
    group_label = "Outcome",
    start_row = 1,
    end_row = 1
  )%>%
  kableExtra::pack_rows(
    group_label = "Predictors",
    start_row = 2,
    end_row = 4
  )
Variable N obs Missing Min 25th perc Mean Median 75th perc Max
Outcome
Support for War 1807 334 0 0 0.72 1 1 1
Predictors
Age 1807 0 18 34 46.65 45 60 99
Female 1807 0 0 0 0.47 0 1 1
Education 1807 13 1 3 3.17 3 4 5

Please use this table to describe a typical respondent to the survey YOUR DESCRIPTION HERE

The median respondent in this sample is about 45 years old and has complete some form of vocational school. About 47 percent of the respondents are female, and 72 percent of the respondents support Russia’s actions in Ukraine.2

table(df_drww$education_f)

         Primary school        Secondary school              Vocational 
                    235                     108                     589 
College or some college         Graduate degree 
                    847                      15 

2 Interpretting long chunks of code

The previous section contained a fairly long chunk of code.

I suspect you had little trouble interpreting the table of summary statistics produced by the second code chunk (summary_table), but may not have followed everything that was happening in the first code chunk (summary_stats).

In this section, I want you to practice interpreting large chunks of code with lots of %>%

  • Recall, that the pipe command takes the outcome from a function and pipes it into the first argument of the function that comes after it.

  • Each function in summary_stats expects a data.frame in its first argument. It does stuff this inputted data.frame and returns a new data.frame. which %>% then passes on to next function.

To learn how to produce a table of summary statistics (a common task for empirical work), it’s helpful to understand what’s happening at each step of the process.

In the code chunk below, starting from df_drww %>% please

  • Copy and paste the all the code up until the next %>%

  • Run this code in your console

  • Write a comment above this code in code chunk

Running code line-by-line, is great way to understand what’s happening and clarify what you understand and what you need to review.

There are about 8 or 9 steps of code to explain (depending on how you copy and paste.) I’ve done steps 1-3 below. Please take the next 10 minutes or so to fill in the rest.

Here’s my commented version of the code

# ---- Step 1 -----

# Create a vector of variables we want to summarize
the_vars <- c("support_war01","age", "is_female", "education_n")


# ---- Step 2 -----

# From df_drww create a new variable is_female that is 1 when sex == "Female
df_drww %>%
 mutate(                             
   is_female = ifelse(sex == "Female",1, 0)
 )


# ---- Step 3 -----

# Select just the columns we want to summarize
df_drww %>%
 mutate(
   is_female = ifelse(sex == "Female",1, 0)
 ) %>%
 select(all_of(the_vars))


# ---- Step 4 -----

# Pivot wide data into long data so that we can use group_by to summarize each variable in Variable
df_drww %>%
 mutate(
   is_female = ifelse(sex == "Female",1, 0)
 ) %>%
 select(all_of(the_vars)) %>%
 pivot_longer(
   cols = all_of(the_vars ),
   names_to = "Variable",
   values_to = "Value"
 )


# ---- Step 5 -----

# Tell R to summarize by the unique levels of the 'Variable' variable
df_drww %>%
 mutate(
   is_female = ifelse(sex == "Female",1, 0)
 ) %>%
 select(all_of(the_vars)) %>%
 pivot_longer(
   cols = all_of(the_vars ),
   names_to = "Variable",
   values_to = "Value"
 ) %>%
 group_by(Variable)


# ---- Step 6 -----

# Calculate summary statistics for each variable (age, is_female, etc.) in Variable
df_drww %>%
 mutate(
   is_female = ifelse(sex == "Female",1, 0)
 ) %>%
 select(all_of(the_vars)) %>%
 pivot_longer(
   cols = all_of(the_vars ),
   names_to = "Variable",
   values_to = "Value"
 ) %>%
 group_by(Variable) %>%
 summarise(
   `N obs` = n(),
   Missing = sum(is.na(Value)),
   Min = min(Value, na.rm = T),
   `25th percentile` = quantile(Value, .25, na.rm=T),
   Mean = mean(Value, na.rm=T),
   Median = median(Value, na.rm = T),
   `75th percentile` = quantile(Value, .75, na.rm=T),
   Max = max(Value, na.rm = T)
 )

# ---- Step 7 -----

# Use case_when to recode the values of Variable to have more informative Labels;
# Make Variable a factor with 'Support for War' as the first level
df_drww %>%
 mutate(
   is_female = ifelse(sex == "Female",1, 0)
 ) %>%
 select(all_of(the_vars)) %>%
 pivot_longer(
   cols = all_of(the_vars ),
   names_to = "Variable",
   values_to = "Value"
 ) %>%
 group_by(Variable) %>%
 summarise(
   `N obs` = n(),
   Missing = sum(is.na(Value)),
   Min = min(Value, na.rm = T),
   `25th percentile` = quantile(Value, .25, na.rm=T),
   Mean = mean(Value, na.rm=T),
   Median = median(Value, na.rm = T),
   `75th percentile` = quantile(Value, .75, na.rm=T),
   Max = max(Value, na.rm = T)
 ) %>%
 mutate(
   Variable = case_when(
     Variable == "age" ~ "Age",
     Variable == "education_n" ~ "Education (1 = Primary school, 5 = Graduate Degree)",
     Variable == "is_female" ~ "Female",
     Variable == "support_war01" ~ "Support for War",
     
   ),
   Variable = factor(Variable, levels = c("Support for War","Age","Female","Education (1 = Primary school, 5 = Graduate Degree)"))
   )

# ---- Step 8 -----

# Re-arrange the table by the levels of the factor `Variable`.
# This puts our outcome, 'Support for War' at the top of the table
df_drww %>%
 mutate(
   is_female = ifelse(sex == "Female",1, 0)
 ) %>%
 select(all_of(the_vars)) %>%
 pivot_longer(
   cols = all_of(the_vars ),
   names_to = "Variable",
   values_to = "Value"
 ) %>%
 group_by(Variable) %>%
 summarise(
   `N obs` = n(),
   Missing = sum(is.na(Value)),
   Min = min(Value, na.rm = T),
   `25th percentile` = quantile(Value, .25, na.rm=T),
   Mean = mean(Value, na.rm=T),
   Median = median(Value, na.rm = T),
   `75th percentile` = quantile(Value, .75, na.rm=T),
   Max = max(Value, na.rm = T)
 ) %>%
 mutate(
   Variable = case_when(
     Variable == "age" ~ "Age",
     Variable == "education_n" ~ "Education (1 = Primary school, 5 = Graduate Degree)",
     Variable == "is_female" ~ "Female",
     Variable == "support_war01" ~ "Support for War",
     
   ),
   Variable = factor(Variable, levels = c("Support for War","Age","Female","Education (1 = Primary school, 5 = Graduate Degree)"))
   ) %>%
 arrange(Variable)


# ---- Step 9 -----

# Save the output to summary_table and print summary_table in the console
df_drww %>%
 mutate(
   is_female = ifelse(sex == "Female",1, 0)
 ) %>%
 select(all_of(the_vars)) %>%
 pivot_longer(
   cols = all_of(the_vars ),
   names_to = "Variable",
   values_to = "Value"
 ) %>%
 group_by(Variable) %>%
 summarise(
   `N obs` = n(),
   Missing = sum(is.na(Value)),
   Min = min(Value, na.rm = T),
   `25th percentile` = quantile(Value, .25, na.rm=T),
   Mean = mean(Value, na.rm=T),
   Median = median(Value, na.rm = T),
   `75th percentile` = quantile(Value, .75, na.rm=T),
   Max = max(Value, na.rm = T)
 ) %>%
 mutate(
   Variable = case_when(
     Variable == "age" ~ "Age",
     Variable == "education_n" ~ "Education (1 = Primary school, 5 = Graduate Degree)",
     Variable == "is_female" ~ "Female",
     Variable == "support_war01" ~ "Support for War",
     
   ),
   Variable = factor(Variable, levels = c("Support for War","Age","Female","Education (1 = Primary school, 5 = Graduate Degree)"))
   ) %>%
 arrange(Variable) ->  summary_table

summary_table

I’ve made a lot of tables of summary statistics in my life, so I have an idea what the end results should look like. When I wrote the code for this section, I did it sequentially, looking at the output of each Step, and saying ok, now I need to do this. My thought process looked like this:

  • To summarize the distribution of sex, I first needed to create a numeric indicator, is_female, which will tell me the proportion of the sample identifying as female.

  • Then I selected just the columns of the data I wanted to summarize using the select() with my vector of variable names the_vars

  • Then I pivoted them from a wide data frame into a long dataframe using pivot_longer() so I could use group_by() to calculate various statistics for each of the four variables.

  • Then I recoded the variable names into something that looks like English (and not R code) using case_when(),

  • I know that R can arrange() things by levels of factor, so I turn the Variable variable into a factor with Support for War as the first level, because I want my outcome variable at the top of the table.

  • Finally, when the resulting output looks like a data frame I want, I save it to summary_table

That seems like a lot when read all at once. If I had asked you to produce a table of summary statistics and given you no guidance, I suspect you might have been stuck. Again, you’ve never had to do this before. But now that you’ve seen how to do this, and walked through the code to understand what’s happening at each stage, hopefully, you could adapt this code for your group projects.

You will almost certainly encounter complications (portals of discovery), along the way. By running your code sequentially, you can see step-by-step what’s happening, and say, did my code produce the expected output. If yes, on to the next step. If no, why?


3 Explore the demographic predictors of Russian’s Support for the War in Ukraine

Today, we will estimate four models to explore how Russian support for the war in Ukraine varies with demographic predictors.

  • m1 provides a baseline model that predicts support for the war as a linear function of age, sex, and education_n measured as numeric scale (1 = Primary School, 5 = Graduate Degree).

  • m2 is an interaction model that asks whether the relationship between age and support varies by sex

  • m3 is a polynomial regression, that includes a term for “age-squared” which allows the relationship between age and support to vary over different levels of age

  • m4 adds an interaction term to the polynomial regression from m3 essentially allowing for separate curves for male and female respondents.

Before you estimate these models, please answer the following:

In the baseline model, m1 what do you expect the sign of the coefficient for each predictor to be:

  • Age (Positive/Negative)

  • Sex (Positive/Negative)3

  • Education (Positive/Negative)

In the interaction model, m2

  • Do you think the relationship between age and support will vary by sex (Yes/No)

  • If you said yes, do you think the coefficient on the interaction between age and sex will be positive or negative (Positive/Negative)

In the polynomial model, m3

  • If the coefficient on age is positive and the coefficient on age^2 is positive, then as age increases, the increase in the predicted level of support will be (increasing/decreasing)

  • If the coefficient on age is positive and the coefficient on age^2 is negative, then as age increases, the increase in the predicted level of support will be (increasing/decreasing)

In m4

  • If the coefficients on the interaction terms between the sex variable and the age variables (age and age^2) is statistically significant, this implies that the relationship between age and support for the war is (similar/different) for male and female respondents.

Uncomment the code below, and replace the ??? with the appropriate terms to fit the following models.

# # Baseline Model
# m1 <- lm(support_war01 ~ age + sex + education_n, df_drww)
# 
# # Interaction model: Allow coefficient for age to vary with sex
# m2 <- lm(support_war01 ~ age*??? + education_n, df_drww)
# 
# # Polynomial model: Allow coefficient for age to vary by age
# m3 <- lm(support_war01 ~ age + I(???^2) + sex + education_n, df_drww)
# 
# # Separate Polynomial: Allow coefficient for age to vary by age separately  by sex
# m4 <- lm(support_war01 ~ (age + I(???))*??? + education_n, df_drww)
# Baseline Model
m1 <- lm(support_war01 ~ age + sex + education_n, df_drww)

# Interaction model: Allow coefficient for age to vary with sex
m2 <- lm(support_war01 ~ age*sex + education_n, df_drww)

# Polynomial model: Allow coefficient for age to vary by age
m3 <- lm(support_war01 ~ age + I(age^2) + sex + education_n, df_drww)

# Separate Polynomial: Allow coefficient for age to vary by age separately  by sex
m4 <- lm(support_war01 ~ (age + I(age^2))*sex + education_n, df_drww)

4 Display your models in a regression table and offer some initial interpretations

Using code from previous labs and lectures, please display the models from the previous section in a regression table and ovver some initial interpretations.

texreg::htmlreg(list(m1,m2,m3,m4)) %>% HTML %>% browsable()
Statistical models
  Model 1 Model 2 Model 3 Model 4
(Intercept) 0.28*** 0.27*** -0.12 -0.12
  (0.05) (0.06) (0.10) (0.14)
age 0.01*** 0.01*** 0.03*** 0.03***
  (0.00) (0.00) (0.00) (0.01)
sexMale 0.09*** 0.11 0.09*** 0.08
  (0.02) (0.07) (0.02) (0.20)
education_n -0.02 -0.02 -0.02* -0.02*
  (0.01) (0.01) (0.01) (0.01)
age:sexMale   -0.00   0.00
    (0.00)   (0.01)
age^2     -0.00*** -0.00**
      (0.00) (0.00)
age^2:sexMale       -0.00
        (0.00)
R2 0.11 0.11 0.12 0.12
Adj. R2 0.11 0.10 0.12 0.12
Num. obs. 1463 1463 1463 1463
***p < 0.001; **p < 0.01; *p < 0.05

In particular, please answer the following:

For m1 (Model 1) how do predicted levels of support for the war, change with

  • Age
  • Sex
  • Education

For m2(Model 2) does the relationship between age and support appear to differ for male and female respondents

For m3 (Model 3) Does the relationship between age and support appear to be constant or does the predicted marginal effect of an increase in age differ4

  • (The marginal effect of age is constant / The marginal effect of age varies)

For m4 (Model 4) do the varying marginal effects of age appear to also vary by gender?5

  • (Yes/No)

5 Use \(R^2\) to compare models

Now take a look at the bottom rows in your regression table which show each model’s \(R^2\) and adjusted \(R^2\).

Recall from class, that \(R^2\) describes the proportion of the variance in our outcome (support for the war), explained by the predictors in our model (age, sex, education, and some interactions/polynomials).

You may also remember that \(R^2\) always increases as we add more predictors to the model. To account for this, we often look at the adjusted \(R^2\) which weights the increase in variance explained (decrease in variance unexplained) by the number of additional predictors needed to produce that increase.

You regression table only reports results to 2 decimal places.

Let’s use the summary() function to extract and save the \(R^2\) (r.squared) and adjusted \(R^2\) (adj.r.squared) for each model. The code for m1 is there as an example.

# m1
m1_r_squared <- summary(m1)$r.squared
m1_adj_r_squared <- summary(m1)$adj.r.squared

# m2
m2_r_squared <- summary(m2)$r.squared
m2_adj_r_squared <- summary(m2)$adj.r.squared

# m3
m3_r_squared <- summary(m3)$r.squared
m3_adj_r_squared <- summary(m3)$adj.r.squared

# m4
m4_r_squared <- summary(m4)$r.squared
m4_adj_r_squared <- summary(m4)$adj.r.squared

m2_adj_r_squared - m1_adj_r_squared
[1] -0.0005784547
m3_adj_r_squared - m1_adj_r_squared
[1] 0.01169598
m4_adj_r_squared - m3_adj_r_squared
[1] -0.0009559246

Please answer the following:

  • How does the \(R^2\) change from m1 to m4 The \(R^2\) increases from 0.1073749 in m1 to 0.1199063 in m4

  • How does the adjusted \(R^2\) change from m1 to m4 The adjusted \(R^2\) actually decreases from m1 to m2, suggesting that allowing the relationship between age and support to vary by sex does not improve our model’s prediction. The adjusted \(R^2\) increases from m1 to m3 by 0.011696, suggesting that modeling age with a polynomial term provides a better fit to the data. Finally The adjusted \(R^2\) decreases from m3 to m4 by -9.5592461^{-4}, suggesting that fitting separate polynomials by sex is uncessary.

  • Overall, which model should we prefer? m3 appears to provide the best fit to the data.

In practice, we can compare the relative performance of nested models using an F-Test from an Analysis of Variance.

Broadly, an F-Test, tests the hypothesis that the additional coefficients in the larger model are jointly 0. To test this claim, we calculate an F-statistic:

\[ F = \frac{(SSR_1 - SSR_2)/(df_1 -df_2)}{SSR_2/df_2}\]

In words:

  • \((SSR_1 - SSR_2)\) Is the difference in the Sum of Squared Residuals of the smaller model \((SSR_1)\)) and the larger model \((SSR_2)\)
  • \((df_1 -df_2)\) is the difference in the “degrees of freedom” (total number of observations - total number of predictors) which reflects the number of additional predictors in the model.
  • \((SSR_2)/df_2\) is the sum of squared residuals divided by the degrees of freedom in the larger model.

This statistic follows an F-Distribution. If the added predictors don’t produce big reductions in the Sum of Squared Residuals, this statistic will be close to 0. Likewise, if our predictors explained a lot of additional variation, this statistic would be fairly large. The larger the statistic, the less likely it is that the coefficients in the larger model are jointly 0.

The code chunk below shows how we could formally test each of our models. Comparing m1 to m2 (which adds an interaction term), we see the reduction in unexplained variance (SSR) is small. In short, m2 doesn’t seem to be an improvement over m1

anova_m1_vs_m2 <- anova(m1,m2)
anova_m1_vs_m2
Analysis of Variance Table

Model 1: support_war01 ~ age + sex + education_n
Model 2: support_war01 ~ age * sex + education_n
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1   1459 263.41                           
2   1458 263.40  1  0.010309 0.0571 0.8112
# ---- Calculate F Stat by hand ----
## Residual Sums of Squares
anova_m1_vs_m2$RSS
[1] 263.4129 263.4026
sum(resid(m1)^2)
[1] 263.4129
sum(resid(m2)^2)
[1] 263.4026
# m resitrictions (how many additional coefficients in m2)
anova_m1_vs_m2$Df
[1] NA  1
m_m2 <- length(coef(m2)) - length(coef(m1))
m_m2
[1] 1
# Degrees of Freedom  (n obs  - predictors)
anova_m1_vs_m2$Res.Df
[1] 1459 1458
df_m1 <- dim(m1$model)[1] - length(coef(m1))
df_m1
[1] 1459
df_m2 <- dim(m2$model)[1] - length(coef(m2))
df_m2
[1] 1458
# F-Stat
anova_m1_vs_m2$F
[1]         NA 0.05706293
f_stat_m1_vs_m2 <- ((sum(resid(m1)^2) - sum(resid(m2)^2))/m_m2)/((sum(resid(m2)^2)/df_m2))
f_stat_m1_vs_m2
[1] 0.05706293
# p-value for test that additional coefficient (interaction between age and sex) is 0
anova_m1_vs_m2$`Pr(>F)`
[1]        NA 0.8112334
pf(f_stat_m1_vs_m2,df1 = m_m2, df2 = df_m2, lower.tail=F)
[1] 0.8112334

Comparing m1 to m3yields significant reduction in the amount of unexplained variance. Modeling age with a polynomial term I(age^2) improves our model’s fit.

anova_m1_vs_m3 <- anova(m1,m3)
anova_m1_vs_m3
Analysis of Variance Table

Model 1: support_war01 ~ age + sex + education_n
Model 2: support_war01 ~ age + I(age^2) + sex + education_n
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1   1459 263.41                                  
2   1458 259.79  1    3.6226 20.331 7.036e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Finally, estimating separate polynomial’s for male and female respondents doesn’t appreciably improve our model’s predictions.

anova_m3_vs_m4 <- anova(m3, m4)
anova_m3_vs_m4
Analysis of Variance Table

Model 1: support_war01 ~ age + I(age^2) + sex + education_n
Model 2: support_war01 ~ (age + I(age^2)) * sex + education_n
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1   1458 259.79                           
2   1456 259.71  2  0.075431 0.2114 0.8094

6 Produce and visualize predicted values for m1 and m3

Now let’s produce and visualize predicted values to help us interpret the relationship between age and support for the war m1 (our baseline model) and m3 (our polynomial model)

We demonstrated the steps for producing predicted values in Tuesday’s lecture

To review: you’ll need

  1. Fit a model (already done)
  2. Create a prediction data frame (pred_df)
  • use expand_grid()
  • vary age using seq()
  • hold sex and education constant at typical values
  1. Use this prediction data frame to obtain predicted values from each model and save the output as a new column in pred_df
  • for example pred_df$fit_m1 <- predict(m1, newdata = pred_df)
  1. Visualize the results using ggplot.
# 2. Create a prediction data frame (`pred_df`)
pred_df <- expand_grid(
  age = seq(
    min(df_drww$age, na.rm = T),
    max(df_drww$age, na.rm = T),
    length.out = 20
    ),
  sex = "Male",
  education_n = mean(df_drww$education_n,na.rm=T)
    )


# 3  Use this prediction data frame to obtain predicted values from our model
# Save the output from predict() back into our prediction data frame

pred_df$fit_m1 <- predict(m1, newdata = pred_df)
pred_df$fit_m3 <- predict(m3, newdata = pred_df)

# 4. Visualize the results using ggplot.


pred_df %>%
  ggplot(aes(age, fit_m1))+
  geom_line() +
  ylim(.3,1.3)+
  labs(x = "Age",
       y = "Predicted support for the war in Ukraine",
       title = "Baseline Regression")+
  theme_bw() -> fig_m1

pred_df %>%
  ggplot(aes(age, fit_m3))+
  geom_line()+
  ylim(.3,1.3) +
  labs(x = "Age",
       y = "Predicted support for the war in Ukraine",
       title = "Age Polynomial Regression")+
  theme_bw() -> fig_m3


ggarrange(fig_m1, fig_m3)

m3

Call:
lm(formula = support_war01 ~ age + I(age^2) + sex + education_n, 
    data = df_drww)

Coefficients:
(Intercept)          age     I(age^2)      sexMale  education_n  
 -0.1159257    0.0279593   -0.0001892    0.0860842   -0.0217176  

Please offer a brief interpretation of these two figures

The left hand panel of the figure shows the predicted values from our baseline model m1. In general, as age increases support for the war tends to go up, although note that for a male respondent with an average level of education over the age of about 75, our model predicts greater rates of support greater than 100 percent, which is outside the range of feasible values.

The righ hand panel shows the predicted values from m3 which model the support for the war as using a polynomial function of age — that is

\[\text{support} = \beta_0 + \beta_1 \text{age} + \beta_2 \text{age}^2 + \beta_3 \text{sex} + \beta_4 \text{education} + \epsilon\] The marginal effect of age in this model varies with the level of age:

\[\frac{\partial \text{support}}{\partial \text{age}}=\beta_1+ 2\times\beta_2\text{age}\]

An increase in age for a 19 year old is associated with marginal increase in support for the war of about 2 percentage points

\[\frac{\partial \text{support}}{\partial \text{age}}=0.0279593+2\times-0.0001892\times\text{19} = 0.0207686\]

coef(m3)[2] + 2*coef(m3)[3]*(19)
      age 
0.0207686 

While the marginal effect for increase in age at 89 year’s old is associated with a marginal decrease in support for the war of of about 0.5 percentage points.

\[\frac{\partial \text{support}}{\partial \text{age}}=0.0279593+2\times-0.0001892\times\text{89} = -0.005723384\]

coef(m3)[2] + 2*coef(m3)[3]*(89)
         age 
-0.005723384 

I’d offer some caution though in interpreting the decreased levels of support for very old respondents. There are only 50 respondents over the age of 75 in the sample (or about 2.7 percent of the overall sample). The general trend is increasing support for the war with age.

sum(df_drww$age >75)
[1] 50
mean(df_drww$age >75)
[1] 0.02767017

7 Produce and visualize predicted values for m2 and m4

Finally, let’s see how to produce and interpret predictions for models m2 and m4.

Recall, that in m2 we allowed the coefficient on age to vary by sex and in m4 our model estimates separate age curves for male and female respondents.

In the code chunk below

  1. Create a new prediction data frame (pred_df_int)
  • use expand_grid()
  • vary age using seq()
  • vary sex = c("Male","Female")
  • hold education constant at a typical value
  1. Use this prediction data frame to obtain predicted values from each model and save the output as a new column in pred_df_int
  • for example: pred_df_int$fit_m2 <- predict(m2, newdata = pred_df_int)
  1. Visualize the results using ggplot.
# 2. Create a prediction data frame (`pred_df`)
pred_df_int <- expand_grid(
  age = seq(
    min(df_drww$age, na.rm = T),
    max(df_drww$age, na.rm = T),
    length.out = 20
    ),
  sex = c("Male","Female"),
  education_n = mean(df_drww$education_n,na.rm=T)
    )


# 3. Use this prediction data frame to obtain predicted values from our model
# Save the output from predict() back into our prediction data frame

pred_df_int$fit_m2 <- predict(m2, newdata = pred_df_int)
pred_df_int$fit_m4 <- predict(m4, newdata = pred_df_int)

# 4. Visualize the results using ggplot.


pred_df_int %>%
  ggplot(aes(age, fit_m2, col=sex))+
  geom_line() +
  ylim(.3,1.3)+
  labs(x = "Age",
       y = "Predicted support for the war in Ukraine",
       col = "Sex",
       title = "Baseline Regression")+
  theme_bw() -> fig_m2

pred_df_int %>%
  ggplot(aes(age, fit_m4, col=sex))+
  geom_line()+
  ylim(.25,1.3) +
  labs(x = "Age",
       y = "Predicted support for the war in Ukraine",
       col = "Sex",
       title = "Age Polynomial Regression")+
  theme_bw() -> fig_m4


ggarrange(fig_m2, fig_m4)

Again,please offer a brief interpretation of these two figures

  • The left hand panel shows the predictions from the interaction model m2 which allows the relationship between age and support for the war to vary by gender. While men tend to be more supportive of the war and older people tend to be more supportive of the war, there doesn’t appear to be much of a difference in the relationship between age and support by gender. If the coefficient on the interaction term had been positive stastitically, the blue line of predicted values for male respondents would have been noticeably steeper. Similiarly, if the coefficient on the interaction term had be negative and statistically significant, the slope on the blue line would have been flatter compared to the slope for female respondents.

  • The right hand panel shows the predicted values for m4 which allows the marginal effect of age to vary by age and sex. The predicted values suggest the gap between male and female respondents gets a little larger with age, and then closes for vary old respondents, but I wouldn’t put too much stock in this model. As the figure below shows, there are only a handful of respondents over the 80 or over in the data, which drive the downward trend in support. The main point of m4 was to illustrate how predicted values help us interpret complicated models. In practice, unless we have a strong theoretical reason to expect non-linear marginal effects that vary by groups, we should probably stick to more parsimonious models.

df_drww %>%
  ggplot(aes(age, support_war01, col = sex))+
  geom_jitter(height = .1)+
  geom_smooth(method ="lm", formula = y ~ poly(x,2))
Warning: Removed 334 rows containing non-finite values (stat_smooth).
Warning: Removed 334 rows containing missing values (geom_point).


8 Explore another relationship in the data (Optional)

Finally, if you’re finished with all the sections above, take some time to explore the data:

  1. Formulate a question you could ask of the data
  2. Estimate a model (or models) that provides insights into this question
  3. Produce a table(s) and figure(s) that helps you interpret your model(s)

For fun, let’s say the group that produces the most combination of question/model/interpretation, gets some candy next class.

Some questions you might explore:

  • How do demographic factors predict trust in Russia’s government
  • Are their systematic differences in who chooses not to answers questions about support for the war or trust in government?
  • Are there noticeable regional differences in support/trust?
  • Should we model education as numeric scale or as a categorical factor?
  • How does the use of social media (either in general or particular types of social media) predict support for the war or Trust in government

Take the Class Survey

Please take a few moments to complete the class survey for this week.


  1. Mathematically, recall that the slope/first derivative of the line \(y = f(x) = 2x\) is constant \((f'(x) = 2)\). If we increase x by 1, we expect y to increase by 2, while the derivative of a parabola \(y = f(z) = z^2\) varies with \(z\) \((f'(z) = 2x)\). Going from z= 2 to z= 3 is associated with a greater increase in y, then going from z=1 to z=2.↩︎

  2. The specific question asked: “Please tell me, do you support or do not support Russia’s military actions on the territory of Ukraine?”↩︎

  3. This is tricky, you need to know what the reference (excluded) category will be. ↩︎

  4. Basically, I’m asking whether the coefficient on I(age^2) is statistically significant. If it is, then change in predicted support for the war among say 20 year olds compared to 30 year olds, would be different than change between 30 to 40 year olds. Interpreting polynomials terms and interaction models is much easier if, as we do later, we simply obtain and plot the predicted values from this model↩︎

  5. As in the previous question, basically you need to look at the table and see if the coefficients on the interaction terms are statistically significant. It’s a little more complicated than this, but if they are significant, this is evidence of differences across Sex.↩︎